"Influence sketching": Finding influential samples in large-scale regressions

نویسندگان

Michael Wojnowicz

Ben Cruz

Xuan Zhao

Brian Wallace

Matt Wolff

Jay Luan

Caleb Crable

چکیده

There is an especially strong need in modern largescale data analysis to prioritize samples for manual inspection. For example, the inspection could target important mislabeled samples or key vulnerabilities exploitable by an adversarial attack. In order to solve the “needle in the haystack" problem of which samples to inspect, we develop a new scalable version of Cook’s distance, a classical statistical technique for identifying samples which unusually strongly impact the fit of a regression model (and its downstream predictions). In order to scale this technique up to very large and high-dimensional datasets, we introduce a new algorithm which we call “influence sketching." Influence sketching embeds random projections within the influence computation; in particular, the influence score is calculated using the randomly projected pseudo-dataset from the post-convergence General Linear Model (GLM). We validate that influence sketching can reliably and successfully discover influential samples by applying the technique to a malware detection dataset of over 2 million executable files, each represented with almost 100,000 features. For example, we find that randomly deleting approximately 10% of training samples reduces predictive accuracy only slightly from 99.47% to 99.45%, whereas deleting the same number of samples with high influence sketch scores reduces predictive accuracy all the way down to 90.24%. Moreover, we find that influential samples are especially likely to be mislabeled. In the case study, we manually inspect the most influential samples, and find that influence sketching pointed us to new, previously unidentified pieces of malware.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Large Scale Metagenomic Sequence Clustering via Sketching and Maximal Quasi-clique Enumeration on Map-Reduce Clusters

Taxonomic clustering of species from millions of DNA fragments sequenced from their genomes is an important and frequently arising problem in metagenomics. High-throughput next generation sequencing is enabling the creation of large metagenomic samples, while at the same time making the clustering problem harder due to the short sequence length supported and sampling of hitherto unknown species...

متن کامل

finding influential individual in Social Network graphs using CSCS algorithm and shapley value in game theory

In recent years, the social networks analysis gains great deal of attention. Social networks have various applications in different areas namely predicting disease epidemic, search engines and viral advertisements. A key property of social networks is that interpersonal relationships can influence the decisions that they make. Finding the most influential nodes is important in social networks b...

متن کامل

Centrality Measures, Upper Bound, and Influence Maximization in Large Scale Directed Social Networks

The paper addresses the problem of finding top k influential nodes in large scale directed social networks. We propose two new centrality measures, Diffusion Degree for independent cascade model of information diffusion and Maximum Influence Degree. Unlike other existing centrality measures, diffusion degree considers neighbors’ contributions in addition to the degree of a node. The measure als...

متن کامل

Large-eddy simulation of turbulent flow over an array of wall-mounted cubes submerged in an emulated atmospheric boundary-layer

Turbulent flow over an array of wall-mounted cubic obstacles has been numerically investigated using large-eddy simulation. The simulations have been performed using high-performance computations with local cluster systems. The array of cubes are fully submerged in a simulated deep rough-wall atmospheric boundary-layer with high turbulence intensity characteristics of environmental turbulent fl...

متن کامل

Cliques Role in Organizational Reputational Influence: A Social Network Analysis

Empirical support for the assumption that cliques are major determinants of reputational influence derives largely from the frequent finding that organizations which claimed that their cliques’ connections are influential had an increased likelihood of becoming influential themselves. It is suggested that the strong and consistent connection in cliques is at least partially responsible for the ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

"Influence sketching": Finding influential samples in large-scale regressions

نویسندگان

چکیده

منابع مشابه

Large Scale Metagenomic Sequence Clustering via Sketching and Maximal Quasi-clique Enumeration on Map-Reduce Clusters

finding influential individual in Social Network graphs using CSCS algorithm and shapley value in game theory

Centrality Measures, Upper Bound, and Influence Maximization in Large Scale Directed Social Networks

Large-eddy simulation of turbulent flow over an array of wall-mounted cubes submerged in an emulated atmospheric boundary-layer

Cliques Role in Organizational Reputational Influence: A Social Network Analysis

عنوان ژورنال:

اشتراک گذاری